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ABSTRACT 

Four criterion-referenced reliability coef f icicents 
wei^e compared to the Kuder-*Richardson estimates and to each other. ' 
The Kuder-Richardson formulas 20 and 21, the Livingston, the 
Subkoviak' and two Huynh coefficients vere computed for a random 
sample of 3.3 criterion-referenced tests. The Subkoviak coefficient 
yielded the highest/ mean value; Huynh 's Kappa yielded the lowest. The 
two Huynh coefficients were highly positively correlated with the 
Kuder-Richafdson ' ^j) and 21 coef;f icients, , an4 with each other; the 
Livingston and thdr Subkoviak indexes were ^highly correlated with each 
other. A two-factor principle components^ sLnialysis suggested that the 
Subkoviak coefficient measured a test dharacteristic that differed 
from the classical internal-consistency coefficients. (Author/CTM) 
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, . ABSTRACT 

The purpose of' this study was to. compare several criteri6n-i;eferenced 

■ '■ " . ■ • i 

reliability^ coefficients to the Kuder -Richard son estimates and to each, other. 

The Kuder-Richardson formulas 20 and 21, the Livingston-, the Subkoviak and 

two Huynh coefficients CK, k) were computed for a random sample of 33 * ' 

criterion-referenced tests. The Subkoviak coefficient yielded the highest , . 
", ' . . . * " A- 

I mean value; Huynh 's Kappa yiel'ded the lowest. The Huynh K and k coefficients' 

V - • • ' ^ 
were highly positively correlated with the Kuder-Richardson 20 and 21 

■ • ■ < . 

(Coefficients, and with each other; the Livingston and the Subkoviak indexes 
were highly correlated with each other.. A two-factor prinbiple components 

V solution suggested that only the Subkoviak cpefficient measured a test 
characteristic that differed from the classical (KR) internal-consistency 
coefficients. f " ' I, 
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' INTRODUCTION . • \ . . , ^ . 

The reliability of a crit^rion-refereYited test can be estimated .by • 
several different -methods , many derived from differing theoretical frameworks • 
and assumptions. Popham and Husek (1969) and Hambleton arid Novick (1973) 
pointed out that classical, estimates of test reliability may be- inadequate , 
: f or tests designed for, criterion-referenced interpretations or mastery 
' decisions. Livingston (1972) proposed a criterion-referenced reliability, 
coefficient derived from a redefinition of classical measurement theory in 

' terms of observed-score deviations from the ^criterion score. . The Livingston 
coefficient^subsequently drew criticism from some researchers (Harris, 1971; ^ 
Hambleton and Novick', 1973). ' Other researchers {Hambleton and Novick, 1973; 
Swaminathan, Hambleton, and Algina, 1974) have proposed two- administration 
consistency-of-mastery-decision in^xes as appropriate reliability coefficients 
for criterion-referenced tests. Still others have present^ed single-administration 
indexes of cons istency-of-d'eci^ ion reli'ability (Marshall and Haertel, 1975; 
Subkoviak, 1977; Huynh, 1977). 

Single Administration Reliability Coefficients ' 

The Kuder-FJiGh^dson Formula^O reliability coefficient (Kud6r and 
Richardson, 1937) is given by: . • ' , 



, ' =4_. , (i .123.) ^ 



0 



X 

' ^ whTere: 

K <r nufch^r of test item^ 
V j'pq = item variance ^ , . 

S = test variance 

The Kuder-RichaVdsoft Fortifi{ria 21 reliability coefficient (Kuder and * 
"X- Richards6nrT937) is-giv^rf by: . : • - » 
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r- = (1 - ) ... 4.2) 

21 K-1 . K S^- • V 



where: 

X = test mean 



Livingston (1972) derived a criterion-referenc*fed reliability cb- 

• ■ # . - ' v . ; ■■■ 

efficient as a correction of classical reliabilaty: 



2-2 
r S"^ + (X - C)^ 

^^ 2 - * 2 



where : 

- classical test relialDility 



(3) 



C = criterion score 
It should be noted that if X = C, will, of coiirse, be equal t 

I iv C C ' 

■ . ■ \ . 

\The Subkoyiak group coefficient of agreement (Subkoviak, 1977) is 

' ; 'i ^ « 

mean of -the individu^ coefficients of agreement (p ) r 

^ . . * • 

• • \ , - - ■ 

^ ■ • . : z pi ■ ■ ■ ' ' 

P = c - - - (4) ^ 

where: i' 

= the coefficient of agreement^ for person 

.0 / . . c 

Huynh (1977) provides .two consistency of classification indexes: 



^21 K ^21 K 



K .=^ .(Pll - .Pi) / (Pi - Pi) (5) 



where ^ : 



,p^^ = I fCX, Y) 



Jk 



\ 



aijd, when K > 10.: 



l^>-(Poo - Po^ ./. CP-- P'O 



/ 



Po = ■ £ f CX) 



'OBJECTIVES OF THIS STUDY . " 

^is study proposed to compare empirically the six single-administration 
reliability coefficients for their usefulness in criterion-referenced testing- 
1. The Kuder-Richardson 20 coefficient 
V. ,2. The Kuder-Richardson 21 coefficient . 



3. "the Livingston (r^^,) coefficient 



4. 'The Subkoviak (P ) Group Index of Consistency 

5. The Huynh Index of Consistency C*^) ' 

6. The Fhiynh Ifcdex of Consistency (k) ^ 



^Huynh presents formulas for the estimation of and Pj based on a beta- 
binominal model . . 

' and p are evaluated by reference to jinivariate and bivariate normal . 

tSBles, a?ter Gupta (1963). 

. ' . ■ ■ ' . ■.. ■ * 
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Data Source 

The data for this study were 33 achievement examinations; these examihatidns 
represent a r^dom sample of .objective format (three-to-five option inultiple- 
choice) criterion-referen<;ed (mastery) examinations from undergraduate teacher 
education classes, medical school cla^SQS and state-wide assessment tests. 

' The number of examination items ranged from 5»to 143, with a mean of 38.9 
items. The number of subjects ,taking these tests ranged- from 5 to 1110 
with a mean of 209.9. These examinations^had standard deviations from Ch98 to 
12.19. Average item difficulty was .26 (proportion incorrect) with a range 
of .12 to .4| while average item discrimination (D) was .24 with a range of 
.14 to .47. ■ ' . ." ■ 

. ■ '1 ' ' ■ '' 

Methods , ' " ' ' * . 

Each of the six reliability coefficients was computed for each examination. 
The Kuder- Richard son 20- reliability coefficient (arid otW^r standard item 

analysis data} were computed for these exams by standard scoring routines. 

- . ' ' ' ■", " ^ ' 

All other reliability coefficients were cJbm^)uted by a special computer program 

» • . ■ . . " • " L ^ ^ 

written for this itudy. A Correlational analysis was then carried out o# 
these data; additionally, a factor analysis was performed to determine the 
uniqueness of coefficient contribution' to the total -variance. ■ . 

Results . > . • V ' « 

■ \ . ■ ' ■•• • ■ .. ■ . 

Table 1 showW that tTie Huynh K. hds. the lowest absolute numerical- mean 

value,, while SubkoViak's P has.^the highest numerical mean value. Table 2 

presents the inter-correlatiOn of ^hese reliability coefficients. „. 



^The authors wish- to acihowl edge Mary Yuen for pi'ogrammi'ng assistance. 



' Table 1 

it . " ^ ■ - . ^ 

• -Reliability Coefficients for 33 Tests 



Coefficient ' Mean S.D. Range 

KR20 .544 .226 .195 to .923 

KR21 .350 .359 -.358 to .87Q - 

v;- 

Kappa (K) .241 .257 ' -.177 to .680 

Kappa Estimate (k) .322 .205 .063 to .644 

r .605 .275 -.064 to .900 
cc 

P .802 . .108 - .580 to .975 
c 



table '2 

"1 







Correlation 


Matrix of Coefficients 






KR20 


KR21 


K 


K ^ 


r P 
cc c 


KR20 


i:o 




V 






KR21 


.949 


1.0 








K 


.961 


. 978 


1.0 


<7 ■ 




k 


'.974 


.996 


.993 


1.0 




r 

cc 


.693 


.644 


.590 


.562 


1.0 ^' ■ 




.393 


.298- 


.226 


.147* 


.833 V l.Q 



*NS; all others, p _< .10 
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A.jjrincipld components factor solution and an orthogonally 
rotated, factor isolutibn are presented in Tables 3 and 4. Factor 1 (unrotated) 
accounts for 73.4 percent of the total variance while the second factor^ 

accounts for 25.4 percent of >the variance. 

' ', * ^ 

-' • Table 3 
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Principle Factor Solution 








n = 33 Exams 


r 






Factor 1 , 


Factor 2 




. , 10120 


.9728 


-.1075 ' 




KR21 


.9761 ' 


-.2026 




K 


.94-^'l - 


-.3120 




k 


I ■ .9782 y 


- . 2084 




J* 

cc .„ 


.7162 


.6794 




P 

A* 


.3466 

V 


.9090 














Vkrimax- Rotated Ractor Solution 








' " ■ n = 33 Exams < ^ 








Factor 1 , , ' 


Factor 2 




:'kr20 - 


. 9442 


.2575 




; KR21 


.9823 , '. - 


.1702 






~^"T9d55 ' ' , 


. 0578 






, .9864 .; , 


.1656 




cc, 


■ .4165 [ , 


.8950 * 




p 


■■ -.0116 . ' 

« ' ■ ^ 'r' ' 


.9727 • 


\ 
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The principle components analysis suggests that the Huynh coefficients 
(K ana~k) and the Livingston Coefficient (r^^) share a great deal in common 
with the classical Kuder-Richardson internal consist^cy reliability 
estimates, while the Subkoviak coefficient appears to be indexing a test quality 
that differs from .the Livingston/Huynh/Kuder-Richardson, formulas. 

Discussion 

Since all coefficients except the Subkoviak load on the (unrotated) 
internal consistency factor in this study, it appears that for criteridn- 
referenced tests like those in this sample, . it would make little difference ' 
whether one uses a classical reliability, estimate), the Huynh indexes, ot 
the Livingston coefficient to assess overall examination quality. All of 
these coefficients appear to be indexing very similar test qualities and one, 
therefore, has a basis for arguing that the/classical coefficients are as 
appropriate for cri-terion-referenced 'tests as the Huynh or Livingston criterion- 
referenced indexes. 

> 

The Livingston Coefficient, r , derived directly from classical test 

cc . 

theory, loads on both factors and loads highest in the Varimax solution on 
the Subkoviak factor. This sugge'sl^s that the Livingston coefficient may be 
intermediate to tiie classical and The" criterion-referenced coefficients. This 

result also suggests that the Livingston r may be more- useful for criterion- 

c c ^ ^ ' 

referenced reliability than its critics .have allowed. 

The Subkoviak coefficient does seem to be ind^xWg^a test a:ttribute ^ 
different from the other coefficients considered. Tljerefore, it may be 

. ■ v \-. 

^cessary and desirable to compute both an internal consistency reliability 
^estimate (or K) and the Subkoviak P^ or the Livingston r^^ for criterion- 
referenced tests- Yet, these results do support the use'^ulness of ^ the , » 



familiar Kuder-Richardson formula 21 coefficient with criterion-referenced 
examinations. This finding may encourage the criterion -referenced examination 
user who does not have access to sophisticated computer facilities. 

Further Research \ 

The general izability of the findings in this study is possibly limited 
by the heterogeneity of the tests in this sample with respect to test length. 
Further research is indicated, with larger /Samples of tests that are more 
homogeneous with respect to test length. Additionally, it is important to 
attempt to replicate these results for various homogeneous ranges of item 
difficulties and discriminations." . 
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